Urban Greenspace and Asthma Prevalence¶
Get Started with Big Data Pipelines¶
Vegetation has the potential to provide many ecosystem services in urban areas, such as cleaner air and water and flood mitigation. However, the results are mixed on relationships between a simple measurement of vegetation cover (such as average NDVI, a measurement of vegetation health) and human health. We do, however, find relationships between landscape metrics that attempt to quantify the connectivity and structure of greenspace and human health. These types of metrics include mean patch size, edge density, and fragmentation.
In this project, I calculate patch, edge, and fragmentation statistics about urban greenspace in Denber. These statistics should be reflective of the connectivity and spread of urban greenspace, which are important for ecosystem function and access. I then use a linear model to identify statistically significant correlations between the distribution of greenspace and health data compiled by the US Center for Disease Control.
Read More¶
Check out this study by Tsai et al. 2019 on the relationship between edge density and life expectancy in Baltimore, MD. The authors also discuss the influence of scale (e.g. the resolution of the imagery) on these relationships, which is important for this case study.
Full citation: Tsai, Wei-Lun, Yu-Fai Leung, Melissa R. McHale, Myron F. Floyd, and Brian J. Reich. 2019. “Relationships Between Urban Green Land Cover and Human Health at Different Spatial Resolutions.” Urban Ecosystems 22 (2): 315–24. https://doi.org/10.1007/s11252-018-0813-3.
Working with larger-than-memory (big) data¶
For this project, we are going to split up the green space (NDVI) data by census tract, because this matches the human health data from the CDC. If we were interested in the average NDVI at this spatial scale, we could easily use a source of multispectral data like Landsat (30m) or even MODIS (250m) without a noticeable impact on our results. However, because we need to know more about the structure of green space within each tract, we need higher resolution data. For that, we will access the National Agricultural Imagery Program (NAIP) data, which is taken for the continental US every few years at 1m resolution. That’s enough to see individual trees and cars! The main purpose of the NAIP data is, as the name suggests, agriculture. However, it’s also a great source for urban areas where lots is happening on a very small scale.
The NAIP data for the City of Chicago takes up about 20GB of space. This amount of data is likely to crash your kernel if you try to load it all in at once. It also would be inconvenient to store on your harddrive so that you can load it in a bit at a time for analysis. Even if you are using a computer that would be able to handle this amount of data, imagine if you were analysing the entire United States over multiple years!
To help with this problem, you will use cloud-based tools to calculate
your statistics instead of downloading rasters to your computer or
container. You can crop the data entirely in the cloud, thereby saving
on your memory and internet connection, using rioxarray.
STEP 1: Set up your analysis¶
As always, before you get started:
- Import necessary packages
- Create reproducible file paths for your project file structure.
- To use cloud-optimized GeoTiffs, we recommend some settings to make sure your code does not get stopped by a momentary connection lapse – see the code cell below.
### Import libraries
import os
import pathlib
import numpy as np
import rasterio
import geopandas as gpd
import pandas as pd
import xarray
import rioxarray as rxr
import rioxarray.merge as rxrmerge
import shapely
import time
## See Warnings
import warnings
## Add visual packages
import geoviews as gv
import holoviews as hv
import hvplot.pandas # Plot vector data
import hvplot.xarray
from cartopy import crs as ccrs
## Data exploration and visualization packages
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats as stats
## Image Processing packages
from scipy.ndimage import convolve
from scipy.ndimage import label
## Using data in the cloud Packages
import pystac_client
## Progress Bar
from tqdm.notebook import tqdm
## OLS sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
!pip install sodapy # To get sodapy
Requirement already satisfied: sodapy in /opt/miniconda3/envs/earth-analytics-python/lib/python3.11/site-packages (2.2.0) Requirement already satisfied: requests>=2.28.1 in /opt/miniconda3/envs/earth-analytics-python/lib/python3.11/site-packages (from sodapy) (2.32.5) Requirement already satisfied: charset_normalizer<4,>=2 in /opt/miniconda3/envs/earth-analytics-python/lib/python3.11/site-packages (from requests>=2.28.1->sodapy) (3.4.4) Requirement already satisfied: idna<4,>=2.5 in /opt/miniconda3/envs/earth-analytics-python/lib/python3.11/site-packages (from requests>=2.28.1->sodapy) (3.11) Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/miniconda3/envs/earth-analytics-python/lib/python3.11/site-packages (from requests>=2.28.1->sodapy) (2.5.0) Requirement already satisfied: certifi>=2017.4.17 in /opt/miniconda3/envs/earth-analytics-python/lib/python3.11/site-packages (from requests>=2.28.1->sodapy) (2025.11.12)
from sodapy import Socrata
### Create reproducible file paths
data_dir = os.path.join(pathlib.Path.home(), "Earth Data Analytics", "Spring 26", "greenspaces", "data",)
os.makedirs(data_dir, exist_ok=True)
### Prevent GDAL from quitting due to momentary disruptions
os.environ["GDAL_HTTP_MAX_RETRY"] = "5"
os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"
STEP 2: Create a site map¶
We’ll be using the Center for Disease Control (CDC) Places dataset for human health data to compare with vegetation metrics. CDC Places also provides some modified census tracts, clipped to the city boundary, to go along with the health data. We’ll start by downloading the matching geographic data, and then select the City of Chicago.
- Download the Census tract Shapefile that goes along with CDC Places
- Use a row filter to select only the census tracts in Chicago
- Use a conditional statement to cache your download. There is no need to cache the full dataset – stick with your pared down version containing only Chicago.
### Set up the census tract path
tract_dir = os.path.join(data_dir, "Denver_tract")
os.makedirs(tract_dir, exist_ok=True)
tract_path = os.path.join(tract_dir, "Denver_tract.shp")
### Download the census tracts (only once)
if not os.path.exists(tract_path):
tract_url = ("https://data.cdc.gov/download/x7zy-2xmx/application%2Fzip") # Census URL
tract_gdf = gpd.read_file(tract_url) # Read the .shp file
den_gdf = tract_gdf[tract_gdf.PlaceName=="Denver"] # Filter only Chicago Data
den_gdf.to_file(tract_path, index=False) # Save this output
### Load in census tract data
den_tract_gdf = gpd.read_file(tract_path)
den_tract_gdf
| place2010 | tract2010 | ST | PlaceName | plctract10 | PlcTrPop10 | geometry | |
|---|---|---|---|---|---|---|---|
| 0 | 0820000 | 08031000102 | 08 | Denver | 0820000-08031000102 | 3109 | POLYGON ((-11691351.798 4834636.885, -11691351... |
| 1 | 0820000 | 08031000201 | 08 | Denver | 0820000-08031000201 | 3874 | POLYGON ((-11688301.532 4835632.272, -11688302... |
| 2 | 0820000 | 08031000202 | 08 | Denver | 0820000-08031000202 | 3916 | POLYGON ((-11688362.201 4834372.228, -11688360... |
| 3 | 0820000 | 08031000301 | 08 | Denver | 0820000-08031000301 | 5003 | POLYGON ((-11691355.36 4833538.467, -11691357.... |
| 4 | 0820000 | 08031000302 | 08 | Denver | 0820000-08031000302 | 4036 | POLYGON ((-11692926.635 4832494.047, -11692925... |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 139 | 0820000 | 08031015500 | 08 | Denver | 0820000-08031015500 | 3313 | MULTIPOLYGON (((-11681197.008 4821951.258, -11... |
| 140 | 0820000 | 08031015600 | 08 | Denver | 0820000-08031015600 | 6498 | POLYGON ((-11688537.194 4817797.736, -11688985... |
| 141 | 0820000 | 08031015700 | 08 | Denver | 0820000-08031015700 | 5535 | POLYGON ((-11691342.668 4817773.296, -11691342... |
| 142 | 0820000 | 08031980000 | 08 | Denver | 0820000-08031980000 | 1165 | POLYGON ((-11646235.781 4840781.884, -11646415... |
| 143 | 0820000 | 08031980100 | 08 | Denver | 0820000-08031980100 | 0 | POLYGON ((-11671535.922 4836666.59, -11671535.... |
144 rows × 7 columns
### Site plot -- Census tracts with satellite imagery in the background
(
den_tract_gdf
## Set projection
.to_crs(ccrs.Mercator())
## Plot with satellite imagery in the background
.hvplot (
line_color = 'orange', fill_color = None,
crs = ccrs.Mercator(), tiles = 'EsriImagery',
frame_width = 600
)
)
What do you notice about the City of Chicago from the coarse satellite image? Is green space evenly distributed? What can you learn about Chicago from websites, scientific papers, or other resources that might help explain what you see in the site map?
In this map, we can clearly see that there is significant greenspace along the eastern border of Denver. Notably there is much less greenspace in the central area and along the highway corridor. We would expect the corridor around I25 to be lacking greenspace due to built up infrastructure in the area. It is also important to onte that the census block in the north-east corner of the map indicates the Denver International Airport and is not an area that is heavily populated. Another important feature to note of this area is the Rocky Mountain Arsenal in the north-east corner
It's still a little difficult to identify how much greenspace exists from satellite images alone, so I'm looking forward to diving into it more.
STEP 3 - Access Asthma and Urban Greenspaces Data¶
Step 3a - Access human health data¶
The U.S. Center for Disease Control (CDC) provides a number of health variables through their Places Dataset that might be correlated with urban greenspace. For this assignment, start with adult asthma. Try to limit the data as much as possible for download. Selecting the state and county is a one way to do this.
- You can access Places data with an API, but as with many APIs it is easier to test out your search before building a URL. Navigate to the Places Census Tract Data Portal and search for the data you want.
- The data portal will make an API call for you, but there is a simpler, easier to read and modify way to form an API call. Check out to the socrata documentation to see how. You can also find some limited examples and a list of available parameters for this API on CDC Places SODA Consumer API Documentation.
- Once you have formed your query, you may notice that you have
exactly 1000 rows. The Places SODA API limits you to 1000 records in a
download. Either narrow your search or check out the
&$limit=parameter to increase the number of rows downloaded. You can find more information on the Paging page of the SODA API documentation - You should also clean up this data by renaming the
'data_value'to something descriptive, and possibly selecting a subset of columns.
### Set up a path for the asthma data
cdc_path = os.path.join(data_dir, 'asthma-den1.csv')
### Download asthma data (only once)
if not os.path.exists(cdc_path):
print("Working...")
cdc_url = (
## url for data
"https://data.cdc.gov/resource/cwsq-ngmh.csv"
## parameters to filter on, selecting data for Asthma rates in Denver in 2023
"?year=2023"
"&stateabbr=CO"
"&countyname=Denver"
"&measureid=CASTHMA"
"&$limit=1500"
)
cdc_df = (
##open it as a CSV
pd.read_csv(cdc_url)
## Rename columns
.rename(columns = {
'data_value': 'asthma',
'low_confidence_limit': 'asthma_ci_low',
'high_confidence_limit': 'asthma_ci_high',
'locationname': 'tract'})
## Select columns to keep
[[
'year','tract',
'asthma', 'asthma_ci_low', 'asthma_ci_high',
'data_value_unit', 'totalpopulation',
'totalpop18plus'
]]
)
## Download to a CSV
cdc_df.to_csv(cdc_path, index = False)
### Load in asthma data
cdc_df = pd.read_csv(cdc_path)
### Preview asthma data
cdc_df
| year | tract | asthma | asthma_ci_low | asthma_ci_high | data_value_unit | totalpopulation | totalpop18plus | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2023 | 8031000301 | 9.9 | 8.9 | 11.1 | % | 5779 | 4885 |
| 1 | 2023 | 8031000401 | 10.4 | 9.2 | 11.7 | % | 3174 | 2735 |
| 2 | 2023 | 8031000201 | 9.8 | 8.6 | 10.9 | % | 3913 | 3222 |
| 3 | 2023 | 8031000705 | 10.5 | 9.4 | 11.7 | % | 3134 | 2414 |
| 4 | 2023 | 8031001102 | 10.0 | 8.9 | 11.2 | % | 5274 | 4868 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 171 | 2023 | 8031005104 | 10.4 | 9.3 | 11.6 | % | 3470 | 2956 |
| 172 | 2023 | 8031015300 | 10.2 | 9.1 | 11.4 | % | 3697 | 3306 |
| 173 | 2023 | 8031006903 | 10.4 | 9.4 | 11.6 | % | 2475 | 2105 |
| 174 | 2023 | 8031006816 | 10.7 | 9.6 | 12.0 | % | 5511 | 4570 |
| 175 | 2023 | 8031006902 | 11.0 | 9.9 | 12.2 | % | 3638 | 3032 |
176 rows × 8 columns
Step 3b - Join health data with census tract boundaries¶
- Join the census tract
GeoDataFramewith the asthma prevalenceDataFrameusing the.merge()method. - You will need to change the data type of one of the merge columns to
match, e.g. using
.astype('int64') - There are a few census tracts in Chicago that do not have data. You should be able to confirm that they are not listed through the interactive Places Data Portal. However, if you have large chunks of the city missing, it may mean that you need to expand the record limit for your download.
### Change tract identifier datatype for merging
den_tract_gdf.tract2010 = den_tract_gdf.tract2010.astype('int64')
### Merge census data with geometry
tract_cdc_gdf = (
den_tract_gdf
.merge(cdc_df,
## specify colummns to merge on
left_on = 'tract2010',
right_on = 'tract',
## use inner join, to keep call
how = 'inner'
)
)
tract_cdc_gdf
| place2010 | tract2010 | ST | PlaceName | plctract10 | PlcTrPop10 | geometry | year | tract | asthma | asthma_ci_low | asthma_ci_high | data_value_unit | totalpopulation | totalpop18plus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0820000 | 8031000102 | 08 | Denver | 0820000-08031000102 | 3109 | POLYGON ((-11691351.798 4834636.885, -11691351... | 2023 | 8031000102 | 10.2 | 9.1 | 11.4 | % | 3622 | 3017 |
| 1 | 0820000 | 8031000201 | 08 | Denver | 0820000-08031000201 | 3874 | POLYGON ((-11688301.532 4835632.272, -11688302... | 2023 | 8031000201 | 9.8 | 8.6 | 10.9 | % | 3913 | 3222 |
| 2 | 0820000 | 8031000202 | 08 | Denver | 0820000-08031000202 | 3916 | POLYGON ((-11688362.201 4834372.228, -11688360... | 2023 | 8031000202 | 10.2 | 9.1 | 11.4 | % | 4042 | 3206 |
| 3 | 0820000 | 8031000301 | 08 | Denver | 0820000-08031000301 | 5003 | POLYGON ((-11691355.36 4833538.467, -11691357.... | 2023 | 8031000301 | 9.9 | 8.9 | 11.1 | % | 5779 | 4885 |
| 4 | 0820000 | 8031000302 | 08 | Denver | 0820000-08031000302 | 4036 | POLYGON ((-11692926.635 4832494.047, -11692925... | 2023 | 8031000302 | 10.0 | 8.9 | 11.2 | % | 4412 | 3731 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 107 | 0820000 | 8031015300 | 08 | Denver | 0820000-08031015300 | 3832 | POLYGON ((-11679896.574 4824084.018, -11679896... | 2023 | 8031015300 | 10.2 | 9.1 | 11.4 | % | 3697 | 3306 |
| 108 | 0820000 | 8031015400 | 08 | Denver | 0820000-08031015400 | 3934 | POLYGON ((-11691351.798 4835608.508, -11691351... | 2023 | 8031015400 | 10.6 | 9.5 | 11.9 | % | 4487 | 3795 |
| 109 | 0820000 | 8031015500 | 08 | Denver | 0820000-08031015500 | 3313 | MULTIPOLYGON (((-11681197.008 4821951.258, -11... | 2023 | 8031015500 | 10.8 | 9.6 | 12.1 | % | 3399 | 3019 |
| 110 | 0820000 | 8031015600 | 08 | Denver | 0820000-08031015600 | 6498 | POLYGON ((-11688537.194 4817797.736, -11688985... | 2023 | 8031015600 | 9.8 | 8.7 | 10.8 | % | 6618 | 4813 |
| 111 | 0820000 | 8031015700 | 08 | Denver | 0820000-08031015700 | 5535 | POLYGON ((-11691342.668 4817773.296, -11691342... | 2023 | 8031015700 | 10.1 | 9.0 | 11.2 | % | 5907 | 4625 |
112 rows × 15 columns
### Plot asthma data as chloropleth
(
## load basemap with satellite imagery
gv.tile_sources.EsriImagery
*
## map census tracts with asthma data
gv.Polygons(
## reproject to align CRSs
tract_cdc_gdf.to_crs(ccrs.Mercator()),
## select cols
vdims = ['asthma', 'tract2010'],
## ensure CRSs align
crs = ccrs.Mercator()
).opts(color = 'asthma', colorbar = True, colorbar_opts = {'title':'Percent Asthma Rate'}, title = 'Adult Asthma Rates across Census Tracts', tools = ['hover'])
).opts(width = 600, height = 600, xaxis = None, yaxis = None)
Write a description and citation for the asthma prevalence data. Do you notice anything about the spatial distribution of asthma in Chicago? From your research on the city, what might be some potential causes of any patterns you see?
Looking at the prevalence of asthma on the CDC website, we can see that some areas of Denver are above the national average of asthma prevalence The national asthma prevalence in 2022 was 8.2%. Much of Denver falls along this range, but there are some areas (in the north and east) where the rate is over 10% and closer to 12-14%. This could be due to significant industrial manufacturing and companies in the area as well as urbanization and lack of greenspaces. It is also interesting to consider if the proximity to not just highway infrastrcture but airplane traffic or industrial pollution (Commerce City) also play a role. There is another tract with a very high asthma rate in central Denver. It would be interesting to see what other demographic variables in that region play a role
Step 3c - Get data URLs for urban greenspace imagery¶
NAIP data are freely available through the Microsoft Planetary Computer SpatioTemporal Access Catalog (STAC).
Get started with STAC by accessing the planetary computer catalog with the following code:
e84_catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1"
)### Connect to the planetary computer catalog
e84_catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1"
)
Using a loop, for each Census Tract:
Use the following sample code to search for data, replacing the names with applicable values with descriptive names:
search = e84_catalog.search( collections=["naip"], intersects=shapely.to_geojson(tract_geometry), datetime=f"{year}" )Access the url using
search.assets['image'].href
Accumulate the urls in a
pd.DataFrameordictfor laterOccasionally you may find that the STAC service is momentarily unavailable. You should include code that will retry the request up to 5 times when you get the
pystac_client.exceptions.APIError.
Warning
As always – DO NOT try to write this loop all at once! Stick with one step at a time, making sure to test your work. You also probably want to add a
breakinto your loop to stop the loop after a single iteration. This will help prevent long waits during debugging.
### Convert geometry to lat/lon for STAC
tract_latlong_gdf = tract_cdc_gdf.to_crs(4326)
### Define a path to save NDVI stats
ndvi_stats_path = os.path.join(data_dir, 'den-ndvi-stats3.csv')
### Check for existing data - do not access duplicate tracts
## Initialize an empty list for the census tract IDs
downloaded_tracts = []
## write conditional
if os.path.exists(ndvi_stats_path):
## if it exists, open the csvs in the path
ndvi_stats_df = pd.read_csv(ndvi_stats_path)
##fill the list with the tract values
downloaded_tracts = ndvi_stats_df.tract.values
## if it doesn't exist, print statement
else:
print('No census tracts downloaded so far.')
### Loop through each census tract
## Make list to accumulate data
scene_dfs = []
## Loop through each tract value in the gdf
for i, tract_values in tqdm(tract_latlong_gdf.iterrows()): #progress bar
## for each i, extract the tract id
tract = tract_values.tract2010
### Check if statistics are already downloaded for this tract
if not (tract in downloaded_tracts):
### Repeat up to 5 times in case of a momentary disruption
i = 0 ## initialize retry counter, if needed
retry_limit = 5 ## max number of tries
while i < retry_limit:
### Try accessing the STAC
try:
### Search for tiles
naip_search = e84_catalog.search(
collections = ['naip'],
intersects = shapely.to_geojson(tract_values.geometry),
datetime = "2021"
)
### Build dataframe with tracts and tile urls
scene_dfs.append(pd.DataFrame(dict(
## column called tract
tract = tract,
date = [pd.to_datetime(scene.datetime).date()
for scene in naip_search.items()], # each scene is one naip image
rgbir_href = [scene.assets['image'].href for scene in naip_search.items()],
)))
## Exit retry loop once STAC request succeeds
break
### Try again in case of an APIError
except pystac_client.exceptions.APIError:
print(
f'Could not connect with STAC server.'
f'Retrying tract {tract}...')
## Wait 2 seconds before retrying
time.sleep(2)
i += 1
continue
### Concatenate the url dataframes
if scene_dfs:
## concatenate
scene_df = pd.concat(scene_dfs).reset_index(drop = True)
## Otherwise, tell me it didn't work
else:
scene_df = None
### Preview the url dataframe
scene_df
No census tracts downloaded so far.
0it [00:00, ?it/s]
| tract | date | rgbir_href | |
|---|---|---|---|
| 0 | 8031000102 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| 1 | 8031000201 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| 2 | 8031000201 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| 3 | 8031000202 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| 4 | 8031000202 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| ... | ... | ... | ... |
| 180 | 8031015500 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| 181 | 8031015500 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| 182 | 8031015600 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| 183 | 8031015600 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| 184 | 8031015700 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
185 rows × 3 columns
print(scene_df)
tract date rgbir_href 0 8031000102 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... 1 8031000201 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... 2 8031000201 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... 3 8031000202 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... 4 8031000202 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... .. ... ... ... 180 8031015500 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... 181 8031015500 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... 182 8031015600 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... 183 8031015600 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... 184 8031015700 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... [185 rows x 3 columns]
## Wtrite out scene_df as a csv
scene_df.to_csv(ndvi_stats_path, index = False)
Step 3d- Compute NDVI Statistics¶
Next, calculate some metrics to get at different aspects of the
distribution of greenspace in each census tract. These types of
statistics are called fragmentation statistics, and can be
implemented with the scipy package. Some examples of fragmentation
statistics are:
Percentage vegetation: The percentage of pixels that exceed a vegetation threshold (.12 is common with Landsat) Mean patch size
Mean patch size: The average size of patches, or contiguous area exceeding the
vegetation threshold. Patches can be identified with the label
function from scipy.ndimage
Edge density: The proportion of edge pixels among vegetated pixels. Edges can be identified by convolving the image with a kernel designed to detect pixels that are different from their surroundings.
What is convolution?
If you are familiar with differential equations, convolution is an approximation of the LaPlace transform.
For the purposes of calculating edge density, convolution means that we are taking all the possible 3x3 chunks for our image, and multiplying it by the kernel:
$$ \text{Kernel} = \begin{bmatrix} 1 & 1 & 1 \\ 1 & -8 & 1 \\ 1 & 1 & 1 \end{bmatrix} $$
The result is a matrix the same size as the original, minus the outermost edge. If the center pixel is the same as the surroundings, its value in the final matrix will be $-8 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 0$. If it is higher than the surroundings, the result will be negative, and if it is lower than the surroundings, the result will be positive. As such, the edge pixels of our patches will be negative.
- Select a single row from the census tract
GeoDataFrameusing e.g..loc[[0]], then select all the rows from your URLDataFramethat match the census tract. - For each URL in crop, merge, clip, and compute NDVI for that census tract
- Set a threshold to get a binary mask of vegetation
- Using the sample code to compute the fragmentation statistics. Feel free to add any other statistics you think are relevant, but make sure to include a fraction vegetation, mean patch size, and edge density. If you are not sure what any line of code is doing, make a plot or print something to find out! You can also ask ChatGPT or the class.
## Open df of NAIP imagery URLs
scene_df = pd.read_csv(ndvi_stats_path)
scene_df
| tract | date | rgbir_href | |
|---|---|---|---|
| 0 | 8031000102 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| 1 | 8031000201 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| 2 | 8031000201 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| 3 | 8031000202 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| 4 | 8031000202 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| ... | ... | ... | ... |
| 180 | 8031015500 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| 181 | 8031015500 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| 182 | 8031015600 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| 183 | 8031015600 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
| 184 | 8031015700 | 2021-07-28 | https://naipeuwest.blob.core.windows.net/naip/... |
185 rows × 3 columns
## Pull out first tract value
tract = den_tract_gdf.loc[0].tract2010
## Convert to string from integer
tract = str(tract)
tract
'8031000102'
href_value = scene_df.loc[
scene_df.tract.astype(str) == tract,
## column to extract
"rgbir_href"
].iloc[0] ## Just grab 1
href_value
'https://naipeuwest.blob.core.windows.net/naip/v002/co/2021/co_060cm_2021/39105/16/m_3910516_se_13_060_20210728.tif'
## Process 1 scene
## Open the raster
tile_da = rxr.open_rasterio(
## give it URL
href_value,
## mask out data pixels
masked = True
## Remove dimensions of length = 1
).squeeze()
## Check it out
tile_da
<xarray.DataArray (band: 4, y: 12230, x: 9600)> Size: 2GB
[469632000 values with dtype=float32]
Coordinates:
* band (band) int64 32B 1 2 3 4
* y (y) float64 98kB 4.407e+06 4.407e+06 ... 4.4e+06 4.4e+06
* x (x) float64 77kB 4.944e+05 4.944e+05 ... 5.002e+05 5.002e+05
spatial_ref int64 8B 0
Attributes:
TIFFTAG_IMAGEDESCRIPTION: OrthoVista
TIFFTAG_SOFTWARE: Trimble Germany GmbH
TIFFTAG_XRESOLUTION: 1
TIFFTAG_YRESOLUTION: 1
TIFFTAG_RESOLUTIONUNIT: 1 (unitless)
AREA_OR_POINT: Area
scale_factor: 1.0
add_offset: 0.0## Check out dimensions
tile_da.dims
('band', 'y', 'x')
## Reminder what is in columns
print(scene_df.columns)
Index(['tract', 'date', 'rgbir_href'], dtype='object')
## Check out the data
tract_cdc_gdf
| place2010 | tract2010 | ST | PlaceName | plctract10 | PlcTrPop10 | geometry | year | tract | asthma | asthma_ci_low | asthma_ci_high | data_value_unit | totalpopulation | totalpop18plus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0820000 | 8031000102 | 08 | Denver | 0820000-08031000102 | 3109 | POLYGON ((-11691351.798 4834636.885, -11691351... | 2023 | 8031000102 | 10.2 | 9.1 | 11.4 | % | 3622 | 3017 |
| 1 | 0820000 | 8031000201 | 08 | Denver | 0820000-08031000201 | 3874 | POLYGON ((-11688301.532 4835632.272, -11688302... | 2023 | 8031000201 | 9.8 | 8.6 | 10.9 | % | 3913 | 3222 |
| 2 | 0820000 | 8031000202 | 08 | Denver | 0820000-08031000202 | 3916 | POLYGON ((-11688362.201 4834372.228, -11688360... | 2023 | 8031000202 | 10.2 | 9.1 | 11.4 | % | 4042 | 3206 |
| 3 | 0820000 | 8031000301 | 08 | Denver | 0820000-08031000301 | 5003 | POLYGON ((-11691355.36 4833538.467, -11691357.... | 2023 | 8031000301 | 9.9 | 8.9 | 11.1 | % | 5779 | 4885 |
| 4 | 0820000 | 8031000302 | 08 | Denver | 0820000-08031000302 | 4036 | POLYGON ((-11692926.635 4832494.047, -11692925... | 2023 | 8031000302 | 10.0 | 8.9 | 11.2 | % | 4412 | 3731 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 107 | 0820000 | 8031015300 | 08 | Denver | 0820000-08031015300 | 3832 | POLYGON ((-11679896.574 4824084.018, -11679896... | 2023 | 8031015300 | 10.2 | 9.1 | 11.4 | % | 3697 | 3306 |
| 108 | 0820000 | 8031015400 | 08 | Denver | 0820000-08031015400 | 3934 | POLYGON ((-11691351.798 4835608.508, -11691351... | 2023 | 8031015400 | 10.6 | 9.5 | 11.9 | % | 4487 | 3795 |
| 109 | 0820000 | 8031015500 | 08 | Denver | 0820000-08031015500 | 3313 | MULTIPOLYGON (((-11681197.008 4821951.258, -11... | 2023 | 8031015500 | 10.8 | 9.6 | 12.1 | % | 3399 | 3019 |
| 110 | 0820000 | 8031015600 | 08 | Denver | 0820000-08031015600 | 6498 | POLYGON ((-11688537.194 4817797.736, -11688985... | 2023 | 8031015600 | 9.8 | 8.7 | 10.8 | % | 6618 | 4813 |
| 111 | 0820000 | 8031015700 | 08 | Denver | 0820000-08031015700 | 5535 | POLYGON ((-11691342.668 4817773.296, -11691342... | 2023 | 8031015700 | 10.1 | 9.0 | 11.2 | % | 5907 | 4625 |
112 rows × 15 columns
tract_cdc_gdf["tract2010"].head()
0 8031000102 1 8031000201 2 8031000202 3 8031000301 4 8031000302 Name: tract2010, dtype: int64
tract_cdc_gdf["tract2010"].dtype
dtype('int64')
tract in tract_cdc_gdf["tract2010"].astype(str).values
True
## Get boundary of the tract we're working with
boundary = (
tract_cdc_gdf
## Deal with issues with integers
.assign(tract2010 = lambda df: df["tract2010"].astype(str))
## Use tract ID as the index
.set_index("tract2010")
## Grab the tract we're using
.loc[[str(tract)]]
## Make sure the CRS match
.to_crs(tile_da.rio.crs)
## Grab geometry
.geometry
)
boundary
tract2010 8031000102 POLYGON ((497842.209 4403809.736, 497842.293 4... Name: geometry, dtype: geometry
## Crop to bounding box
crop_da = tile_da.rio.clip_box(
## Get coordinates of the bounding box
*boundary.envelope.total_bounds,
## Let it expand a litte
auto_expand = True
)
crop_da
<xarray.DataArray (band: 4, y: 1562, x: 4006)> Size: 100MB
[25029488 values with dtype=float32]
Coordinates:
* band (band) int64 32B 1 2 3 4
* y (y) float64 12kB 4.404e+06 4.404e+06 ... 4.403e+06 4.403e+06
* x (x) float64 32kB 4.954e+05 4.954e+05 ... 4.978e+05 4.978e+05
spatial_ref int64 8B 0
Attributes:
TIFFTAG_IMAGEDESCRIPTION: OrthoVista
TIFFTAG_SOFTWARE: Trimble Germany GmbH
TIFFTAG_XRESOLUTION: 1
TIFFTAG_YRESOLUTION: 1
TIFFTAG_RESOLUTIONUNIT: 1 (unitless)
AREA_OR_POINT: Area
scale_factor: 1.0
add_offset: 0.0## Clip to exact geometry of the census tract
clip_da = crop_da.rio.clip(
boundary,
## Include all pixels touching the boundary
all_touched = True)
clip_da
<xarray.DataArray (band: 4, y: 1562, x: 4006)> Size: 100MB
array([[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]],
shape=(4, 1562, 4006), dtype=float32)
Coordinates:
* band (band) int64 32B 1 2 3 4
* y (y) float64 12kB 4.404e+06 4.404e+06 ... 4.403e+06 4.403e+06
* x (x) float64 32kB 4.954e+05 4.954e+05 ... 4.978e+05 4.978e+05
spatial_ref int64 8B 0
Attributes:
TIFFTAG_IMAGEDESCRIPTION: OrthoVista
TIFFTAG_SOFTWARE: Trimble Germany GmbH
TIFFTAG_XRESOLUTION: 1
TIFFTAG_YRESOLUTION: 1
TIFFTAG_RESOLUTIONUNIT: 1 (unitless)
AREA_OR_POINT: Area
scale_factor: 1.0
add_offset: 0.0clip_da.shape
(4, 1562, 4006)
NAIP / NDVI Notes
Band 1 is Visible Red Band 2 is Visible Green Band 3 is Visiblue Blue Band 4 is Near-Infrared (NIR)
NDVI = (NIR - Red) / (NIR + Red)
-1 to 0 NDVI = Dead plants or inanimate objects 0 - 0.33 NDVI = Unhealthy plants 0.33 - 0.66 NDVI = Moderately Healthy plants 0.66 - 1 NDVI = Very Healthy plants
## Do NDVI math
ndvi_da = (
(clip_da.sel(band = 4) - clip_da.sel(band = 1))
/ (clip_da.sel(band = 4) + clip_da.sel(band = 1))
)
ndvi_da
<xarray.DataArray (y: 1562, x: 4006)> Size: 25MB
array([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
shape=(1562, 4006), dtype=float32)
Coordinates:
* y (y) float64 12kB 4.404e+06 4.404e+06 ... 4.403e+06 4.403e+06
* x (x) float64 32kB 4.954e+05 4.954e+05 ... 4.978e+05 4.978e+05
spatial_ref int64 8B 0
Attributes:
TIFFTAG_IMAGEDESCRIPTION: OrthoVista
TIFFTAG_SOFTWARE: Trimble Germany GmbH
TIFFTAG_XRESOLUTION: 1
TIFFTAG_YRESOLUTION: 1
TIFFTAG_RESOLUTIONUNIT: 1 (unitless)
AREA_OR_POINT: Area
scale_factor: 1.0
add_offset: 0.0print(scene_df)
tract date rgbir_href 0 8031000102 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... 1 8031000201 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... 2 8031000201 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... 3 8031000202 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... 4 8031000202 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... .. ... ... ... 180 8031015500 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... 181 8031015500 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... 182 8031015600 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... 183 8031015600 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... 184 8031015700 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/... [185 rows x 3 columns]
### Skip this step if data are already downloaded
if not scene_df is None: ## conditional
### Get an example
## PUll out the identifier for an example tract
tract = den_tract_gdf.loc[0].tract2010
## subset scene_df to only include that tract
ex_scene_gdf = scene_df[scene_df.tract == tract]
### Loop through all images for tract
## Make a list to accumulate into
tile_das = [] ## iniiatie an empty list
## Loop through all the images for this tract
for _, href_s in ex_scene_gdf.iterrows():
### Open vsi connection to data
tile_da = rxr.open_rasterio(
##location of the raster
href_s.rgbir_href,
## deal with no data, remove extra dimensions
masked = True).squeeze()
### Crop data to the bounding box of the census tract
## Make the bounding box
boundary = (
tract_cdc_gdf
## Use tract ID as index
.set_index('tract2010')
##s select geometry for the tract
.loc[[tract]]
# Reproject
.to_crs(tile_da.rio.crs)
# Extract Geometry
.geometry
)
## Crop to bounding xo
crop_da = tile_da.rio.clip_box(
## get coordinates for the boudning box
*boundary.envelope.total_bounds,
## let expand
auto_expand = True
)
### Clip data to the boundary of the census tract
clip_da = crop_da.rio.clip(
boundary, all_touched = True)
### Compute NDVI
ndvi_da = (
(clip_da.sel(band = 4) - clip_da.sel(band = 1)) /
(clip_da.sel(band = 4) + clip_da.sel(band = 1))
)
### Accumulate result
tile_das.append(ndvi_da)
## Check
print(f"Number of tiles in tiles_da: {len(tile_das)}")
### Merge data into a single object
scene_da = rxrmerge.merge_arrays(tile_das)
### Mask vegetation
veg_mask = (scene_da > .3) # Greater than moderately healthy plants
## Count Patch pixels
### Calculate mean patch size
labeled_patches, num_patches = label(veg_mask)
patch_sizes = np.bincount(labeled_patches.ravel())[1:]
mean_patch_size = patch_sizes.mean()
## Check it out
patch_sizes
Number of tiles in tiles_da: 1
array([23, 62, 1, ..., 6, 2, 1], shape=(8043,))
mean_patch_size
np.float64(208.43590699987567)
### Make kernel to calculate edge density
kernel = np.array([
[1, 1, 1],
[1, -8, 1],
[1, 1, 1]])
### Calculate edge density
edges = convolve(veg_mask, kernel, mode='constant')
edge_density = np.sum(edges != 0) / veg_mask.size
edge_density
np.float64(0.14472529362166736)
Repeat for all tracts¶
Using a loop, for each Census Tract:
Using a loop, for each data URL:
- Use
rioxarrayto open up a connection to the STAC asset, just like you would a file on your computer - Crop and then clip your data to the census tract boundary > HINT:
check out the
.clip_boxparameterauto_expandand theclipparameterall_touchedto make sure you don’t end up with an empty array - Compute NDVI for the tract
- Use
Merge the NDVI rasters
Compute:
- total number of pixels within the tract
- fraction of pixels with an NDVI greater than .12 within the tract (and any other statistics you would like to look at)
Accumulate the statistics in a file for later
Using a conditional statement, ensure that you do not run this computation if you have already saved values. You do not want to run this step many times, or have to restart from scratch! There are many approaches to this, but we actually recommend implementing your caching in the previous cell when you generate your dataframe of URLs, since that step can take a few minutes as well. However, the important thing to cache is the computation.
## Make CSV for NDVI data
ndvi_stats_path_veg = os.path.join(data_dir, 'den-ndvi-stats-veg.csv')
### Skip this step if data are already downloaded
if not scene_df is None:
### Loop through the census tracts with URLs
for tract, tract_date_gdf in tqdm(scene_df.groupby('tract')):
### Open all images for tract
## Create a list to store NDVI arrays, with 1 array per NAIP image
tile_das = []
## iterate over rows in tract-specific data frame
for _, href_s in tract_date_gdf.iterrows():\
### Open vsi connection to data
tile_da = rxr.open_rasterio(
## mask and squeeze
href_s.rgbir_href, masked = True).squeeze()
### Clip data to the tract geometry
## Get track boundary
boundary = (
tract_cdc_gdf
.set_index('tract2010') # set the indext to the 2010 tract
.loc[[tract]] # grab the row for that tract
.to_crs(tile_da.rio.crs) # reproject and allign coordinates
.geometry # grab geometry only
)
## Crop the raster to the bounding box
crop_da = tile_da.rio.clip_box(
*boundary.envelope.total_bounds,
auto_expand = True # ensure that we're getting everything on the boundary
)
## Clip to actual tract geometry
clip_da = crop_da.rio.clip(boundary, all_touched = True)
### Compute NDVI
ndvi_da = (
(clip_da.sel(band = 4) - clip_da.sel(band = 1)) /
(clip_da.sel(band = 4) + clip_da.sel(band = 1))
)
### Accumulate result
tile_das.append(ndvi_da)
### Merge data
scene_da = rxrmerge.merge_arrays(tile_das)
### Mask vegetation
veg_mask = (scene_da > .3)
### Calculate statistics and save data to file
## Count all valid pixels in the tract
total_pixels = scene_da.notnull().sum()
## Count all vegetated pixels
veg_pixels = veg_mask.sum()
### Identify vegetation patches
labeled_patches, num_patches = label(veg_mask) # label each patch to keep track of them
## Count patch pixels
patch_sizes = np.bincount(labeled_patches.ravel())[1:]
## Calculate mean patch size
mean_patch_size = patch_sizes.mean() # Take mean of all patch sizes in that tract
### Calculate edge density
## Define a kernel
kernel = np.array([
[1, 1, 1],
[1, -8, 1],
[1, 1, 1]])
## Detect boundaries between veg and non-veg
edges = convolve(veg_mask, kernel, mode = 'constant')
## Calculate proportation of edge pixels by tract area
edge_density = np.sum(edges != 0) / veg_mask.size
### Add a row to the statistics file for this tract
## Create new dataframe for the tract and append it to the csv
pd.DataFrame(dict( # Create a dictionary and then turn that dictionary into a dataframe using panda
## Store tract ID
tract = [tract],
## Store total pixels
total_pixels = [int(total_pixels)],
## Store fraction of veg pixels
frac_veg = [float(veg_pixels / total_pixels)],
## Store mean patch size
mean_patch_size = [mean_patch_size],
## Store edge density
edge_density = [edge_density]
## Write out as a CSV and save to location we've identified
)).to_csv(
ndvi_stats_path_veg,
## Append mode, if CSV already exists, add to it, don't overwrite it
mode = 'a',
## Get rid of row nubmers
index = False,
## Write column names if the CSV doesn't exist yet
header = (not os.path.exists(ndvi_stats_path_veg))
)
### Re-load results from file
0%| | 0/112 [00:00<?, ?it/s]
## Check the file
ndvi_stats_df = pd.read_csv(ndvi_stats_path_veg)
ndvi_stats_df
| tract | total_pixels | frac_veg | mean_patch_size | edge_density | |
|---|---|---|---|---|---|
| 0 | 8031000102 | 5321205 | 0.315051 | 208.435907 | 0.144725 |
| 1 | 8031000201 | 5804883 | 0.189782 | 108.165047 | 0.099341 |
| 2 | 8031000202 | 4920357 | 0.232858 | 117.814190 | 0.158857 |
| 3 | 8031000301 | 5354592 | 0.298791 | 176.550651 | 0.198788 |
| 4 | 8031000302 | 4011626 | 0.297961 | 158.760393 | 0.216279 |
| ... | ... | ... | ... | ... | ... |
| 331 | 8031015300 | 3194630 | 0.288895 | 178.755375 | 0.114123 |
| 332 | 8031015400 | 8819672 | 0.308673 | 216.132026 | 0.137208 |
| 333 | 8031015500 | 2592303 | 0.130041 | 144.742808 | 0.091326 |
| 334 | 8031015600 | 10378313 | 0.144731 | 79.930928 | 0.071562 |
| 335 | 8031015700 | 5831357 | 0.274495 | 169.491847 | 0.143654 |
336 rows × 5 columns
STEP 4 - Explore your data with plots¶
Chloropleth plots¶
Before running any statistical models on your data, you should check that your download worked. You should see differences in the asthma rates and mean NDVI across the City.
Create a plot that contains:
- 2 side-by-side chloropleth plots
- Asthma prevelence on one plot and mean NDVI on the other
- Make sure to include a title and labeled color bars
### Merge census data with geometry
den_ndvi_cdc_gdf = (
## Grab census tract geometry
tract_cdc_gdf
## Merge with NDVI dataframe
.merge(
ndvi_stats_df,
## match on the tract ID
## left: tract2010
## right: tract
left_on = 'tract2010',
right_on = 'tract',
how = 'inner' ## keep tracts from both data sets
)
)
## check it out
den_ndvi_cdc_gdf
| place2010 | tract2010 | ST | PlaceName | plctract10 | PlcTrPop10 | geometry | year | tract_x | asthma | asthma_ci_low | asthma_ci_high | data_value_unit | totalpopulation | totalpop18plus | tract_y | total_pixels | frac_veg | mean_patch_size | edge_density | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0820000 | 8031000102 | 08 | Denver | 0820000-08031000102 | 3109 | POLYGON ((-11691351.798 4834636.885, -11691351... | 2023 | 8031000102 | 10.2 | 9.1 | 11.4 | % | 3622 | 3017 | 8031000102 | 5321205 | 0.315051 | 208.435907 | 0.144725 |
| 1 | 0820000 | 8031000102 | 08 | Denver | 0820000-08031000102 | 3109 | POLYGON ((-11691351.798 4834636.885, -11691351... | 2023 | 8031000102 | 10.2 | 9.1 | 11.4 | % | 3622 | 3017 | 8031000102 | 5321205 | 0.315051 | 208.435907 | 0.144725 |
| 2 | 0820000 | 8031000102 | 08 | Denver | 0820000-08031000102 | 3109 | POLYGON ((-11691351.798 4834636.885, -11691351... | 2023 | 8031000102 | 10.2 | 9.1 | 11.4 | % | 3622 | 3017 | 8031000102 | 5321205 | 0.315051 | 208.435907 | 0.144725 |
| 3 | 0820000 | 8031000201 | 08 | Denver | 0820000-08031000201 | 3874 | POLYGON ((-11688301.532 4835632.272, -11688302... | 2023 | 8031000201 | 9.8 | 8.6 | 10.9 | % | 3913 | 3222 | 8031000201 | 5804883 | 0.189782 | 108.165047 | 0.099341 |
| 4 | 0820000 | 8031000201 | 08 | Denver | 0820000-08031000201 | 3874 | POLYGON ((-11688301.532 4835632.272, -11688302... | 2023 | 8031000201 | 9.8 | 8.6 | 10.9 | % | 3913 | 3222 | 8031000201 | 5804883 | 0.189782 | 108.165047 | 0.099341 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 331 | 0820000 | 8031015600 | 08 | Denver | 0820000-08031015600 | 6498 | POLYGON ((-11688537.194 4817797.736, -11688985... | 2023 | 8031015600 | 9.8 | 8.7 | 10.8 | % | 6618 | 4813 | 8031015600 | 10378313 | 0.144731 | 79.930928 | 0.071562 |
| 332 | 0820000 | 8031015600 | 08 | Denver | 0820000-08031015600 | 6498 | POLYGON ((-11688537.194 4817797.736, -11688985... | 2023 | 8031015600 | 9.8 | 8.7 | 10.8 | % | 6618 | 4813 | 8031015600 | 10378313 | 0.144731 | 79.930928 | 0.071562 |
| 333 | 0820000 | 8031015700 | 08 | Denver | 0820000-08031015700 | 5535 | POLYGON ((-11691342.668 4817773.296, -11691342... | 2023 | 8031015700 | 10.1 | 9.0 | 11.2 | % | 5907 | 4625 | 8031015700 | 5831357 | 0.274495 | 169.491847 | 0.143654 |
| 334 | 0820000 | 8031015700 | 08 | Denver | 0820000-08031015700 | 5535 | POLYGON ((-11691342.668 4817773.296, -11691342... | 2023 | 8031015700 | 10.1 | 9.0 | 11.2 | % | 5907 | 4625 | 8031015700 | 5831357 | 0.274495 | 169.491847 | 0.143654 |
| 335 | 0820000 | 8031015700 | 08 | Denver | 0820000-08031015700 | 5535 | POLYGON ((-11691342.668 4817773.296, -11691342... | 2023 | 8031015700 | 10.1 | 9.0 | 11.2 | % | 5907 | 4625 | 8031015700 | 5831357 | 0.274495 | 169.491847 | 0.143654 |
336 rows × 20 columns
### Plot chloropleths with vegetation statistics
## Make a function - plot_chloropleth
def plot_chloropleth(gdf, **opts):
## docstring to describe what the function is doing
"""Generate a chloropleth with the given color columnn"""
## Use geoviews to make a polygon object
return gv.Polygons(
## Use mercator projection for map
gdf.to_crs(ccrs.Mercator()),
## Set plot to display in mercator as well
crs = ccrs.Mercator()
## Drop x and y axis and add a legend
).opts(xaxis = None, yaxis = None, colorbar=True, **opts)
## Check that function was created
plot_chloropleth
<function __main__.plot_chloropleth(gdf, **opts)>
## Apply the function to Chicago data
(
plot_chloropleth(
## Plot asthma by census tract
den_ndvi_cdc_gdf, color = 'asthma', cmap = 'viridis', title = 'Asthma Prevalence across Denver', colorbar_opts = {'title':'Percent Asthma Rate'})
+
## add a second plot with vegetation edge density
plot_chloropleth(
den_ndvi_cdc_gdf, color = 'edge_density', cmap = 'Greens', title = 'Vegetation Density across Denver',colorbar_opts = {'title':'Percent Vegetation'})
)
Do you see any similarities in your plots? Do you think there is a relationship between adult asthma and any of your vegetation statistics in Denver Relate your visualization to the research you have done (the context of your analysis) if applicable.
In the plots above we can see that there are high rates of asthma in Northwest Denver and also in a specific tract in north-central Denver. Just looking at these maps, it seems that there is also limited vegetation in these regions as well. This seems to show that there could be an association between these two variables (moreso than studies looking at Chicago, for example). As mentioned previously, these high asthma rates in the northeast also could be associated with heavy industry near Commerce City for example.
STEP 5: Explore a linear ordinary least-squares regression¶
Model description¶
One way to find if there is a statistically significant relationship between asthma prevalence and greenspace metrics is to run a linear ordinary least squares (OLS) regression and measure how well it is able to predict asthma given your chosen fragmentation statistics.
Before fitting an OLS regression, you should check that your data are appropriate for the model.
Write a model description for the linear ordinary least-squares regression that touches on:
- Assumptions made about the data
- What is the objective of this model? What metrics could you use to evaluate the fit?
- Advantages and potential problems with choosing this model.
For our model, we create a predicted asthma variable using OLS regression, and vegetation variables (edge density, patch size, and fraction of vegetation). In this data, we assume there is some association between vegetation and asthma rates. Ideally that higher vegetation would lead to improved asthma rates. We will use the OLS regression to see if this association is correct. This model assumed that asthma rates DEPEND on all of the other vegetation variables. This could be problematic because there could be other variables playin a role (like urbanization and industrialization)
Step 5a - Data preparation¶
When fitting statistical models, you should make sure that your data meet the model assumptions through a process of selection and/or transformation.
You can select data by:
- Eliminating observations (rows) or variables (columns) that are missing data
- Selecting a model that matches the way in which variables are related to each other (for example, linear models are not good at modeling circles)
- Selecting variables that explain the largest amount of variability in the dependent variable.
You can transform data by:
- Transforming a variable so that it follows a normal distribution.
The
logtransform is the most common to eliminate excessive skew (e.g. make the data symmetrical), but you should select a transform most suited to your data. - Normalizing or standardizing variables to, for example, eliminate negative numbers or effects caused by variables being in a different range.
- Performing a principle component analysis (PCA) to eliminate multicollinearity among the predictor variables
Tip
Keep in mind that data transforms like a log transform or a PCA must be reversed after modeling for the results to be meaningful.
- Use the
hvplot.scatter_matrix()function to create an exploratory plot of your data. - Make any necessary adjustments to your data to make sure that they meet the assumptions of a linear OLS regression.
- Check if there are
NaNvalues, and if so drop those rows and/or columns. You can use the.dropna()method to drop rows withNaNvalues. - Explain any data transformations or selections you made and why
## Are there any columns with missing values?
den_ndvi_cdc_gdf.isna().any()
## Are there any rows with missing values?
den_ndvi_cdc_gdf.isna().any(axis = 1)
## Are there any missing data in rows OR columns?
den_ndvi_cdc_gdf.isna().any().any()
## How many missing values are in columns?
den_ndvi_cdc_gdf.isna().sum()
## True = there is missing data
place2010 0 tract2010 0 ST 0 PlaceName 0 plctract10 0 PlcTrPop10 0 geometry 0 year 0 tract_x 0 asthma 0 asthma_ci_low 0 asthma_ci_high 0 data_value_unit 0 totalpopulation 0 totalpop18plus 0 tract_y 0 total_pixels 0 frac_veg 0 mean_patch_size 0 edge_density 0 dtype: int64
### Visualize distribution of data
## Select variables
variables = ['frac_veg', 'asthma', 'mean_patch_size', 'edge_density']
## Make a scatterplot
hvplot.scatter_matrix(
den_ndvi_cdc_gdf
[variables]
)
## Histograms
## Make facet
fig, axes = plt.subplots(2, 2, figsize = (12,10))
## list variables to plot
variables = ['frac_veg', 'asthma', 'mean_patch_size', 'edge_density']
titles = ['Vegetated fraction', 'Adult Asthma Rate', 'Mean Patch Size', 'Edge Density']
## Loop through variables and axes to plot histograms
for i, (var, title) in enumerate(zip(variables, titles)):
ax = axes[i//2, i%2]
ax.hist(den_ndvi_cdc_gdf[var], bins = 10, color = 'gray', edgecolor = 'black')
ax.set_title(title)
ax.set_xlabel(var)
ax.set_ylabel('Frequency')
## Adjust layers to prevent overlap
plt.tight_layout()
plt.show()
## Drop missing observations
model_df = (
den_ndvi_cdc_gdf
## Make a copy of the dataframe
.copy()
## Subset to columns
[['frac_veg', 'asthma', 'mean_patch_size', 'edge_density', 'geometry']]
## Drop rows (observations) with missing values
.dropna()
)
model_df
| frac_veg | asthma | mean_patch_size | edge_density | geometry | |
|---|---|---|---|---|---|
| 0 | 0.315051 | 10.2 | 208.435907 | 0.144725 | POLYGON ((-11691351.798 4834636.885, -11691351... |
| 1 | 0.315051 | 10.2 | 208.435907 | 0.144725 | POLYGON ((-11691351.798 4834636.885, -11691351... |
| 2 | 0.315051 | 10.2 | 208.435907 | 0.144725 | POLYGON ((-11691351.798 4834636.885, -11691351... |
| 3 | 0.189782 | 9.8 | 108.165047 | 0.099341 | POLYGON ((-11688301.532 4835632.272, -11688302... |
| 4 | 0.189782 | 9.8 | 108.165047 | 0.099341 | POLYGON ((-11688301.532 4835632.272, -11688302... |
| ... | ... | ... | ... | ... | ... |
| 331 | 0.144731 | 9.8 | 79.930928 | 0.071562 | POLYGON ((-11688537.194 4817797.736, -11688985... |
| 332 | 0.144731 | 9.8 | 79.930928 | 0.071562 | POLYGON ((-11688537.194 4817797.736, -11688985... |
| 333 | 0.274495 | 10.1 | 169.491847 | 0.143654 | POLYGON ((-11691342.668 4817773.296, -11691342... |
| 334 | 0.274495 | 10.1 | 169.491847 | 0.143654 | POLYGON ((-11691342.668 4817773.296, -11691342... |
| 335 | 0.274495 | 10.1 | 169.491847 | 0.143654 | POLYGON ((-11691342.668 4817773.296, -11691342... |
336 rows × 5 columns
### Perform variable transformation
## Log of asthma rate
model_df['log_asthma'] = np.log(model_df.asthma)
## Log of patch size
model_df['log_patch'] = np.log(model_df.mean_patch_size)
### Visualize transformed variables
hvplot.scatter_matrix(
model_df
[[
'frac_veg',
'edge_density',
'log_patch',
'log_asthma'
]]
)
## Q-Q plots
## Set variables
var_qq = ['frac_veg', 'edge_density', 'log_patch', 'log_asthma']
## Make Q-Q plot for each variable
plt.figure(figsize = (12,10))
for i, var in enumerate(var_qq, 1):
## Make 2 x 2 facet
plt.subplot(2, 2, i)
## Norm distribution
stats.probplot(model_df[var], dist = "norm", plot = plt)
## Show the plot
plt.tight_layout()
plt.show()
Reflect and respond: EXPLAIN YOUR SELECTION AND TRANSFORMATION PROCESS HERE
This QQ Plot reveals that this data follows a normal distribution pretty well. The lowest and highest ordered values appear to have follow the least normal distribution
Step 5b - Fit and Predict¶
If you have worked with statistical models before, you may notice that
the scikitlearn library has a slightly different approach than many
software packages. For example, scikitlearn emphasizes generic model
performance measures like cross-validation and importance over
coefficient p-values and correlation. The scikitlearn approach is meant
to generalize more smoothly to machine learning (ML) models where the
statistical significance is harder to derive mathematically.
- Use the scikitlearn documentation or ChatGPT as a starting point, split your data into training and testing datasets.
- Fit a linear regression to your training data.
- Use your fitted model to predict the testing values.
- Plot the predicted values against the measured values. You can use the following plotting code as a starting point.
### Select predictor and outcome variables
X = model_df[['edge_density', 'log_patch']] # Explanatory variables
y = model_df[['log_asthma']]
### Split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.33, random_state = 19 # Make sure data split is reproducible, "seed"
)
### Fit a linear regression
reg = LinearRegression()
## Fit to our data
reg.fit(X_train, y_train)
### Predict asthma values for the test dataset
y_test['pred_asthma'] = np.exp(reg.predict(X_test)) # Unlog, make a column for predict asthma
## Real asthma rate for comparison
y_test['asthma'] = np.exp(y_test.log_asthma)
## Check it out
y_test
| log_asthma | pred_asthma | asthma | |
|---|---|---|---|
| 92 | 2.322388 | 10.479209 | 10.2 |
| 293 | 2.351375 | 10.547584 | 10.5 |
| 27 | 2.617396 | 10.755577 | 13.7 |
| 143 | 2.341806 | 10.280625 | 10.4 |
| 158 | 2.312535 | 10.290222 | 10.1 |
| ... | ... | ... | ... |
| 315 | 2.272126 | 10.176164 | 9.7 |
| 16 | 2.341806 | 10.329474 | 10.4 |
| 237 | 2.312535 | 10.337080 | 10.1 |
| 128 | 2.292535 | 10.161204 | 9.9 |
| 137 | 2.379546 | 10.270777 | 10.8 |
111 rows × 3 columns
### Plot measured vs. predicted asthma prevalence with a 1-to-1 line
## Scale y-axis
y_max = y_test.asthma.max()
(
y_test
.hvplot.scatter(x='asthma',
y='pred_asthma',
xlabel = "Measured adult asthma prevalence",
ylabel = "Predicted adult asthma prevalence",
title = "Linear Regression Performance - Testing Data"
)
.opts(aspect='equal',
xlim=(0, y_max), ylim=(0, y_max),
width=600, height=600)
) * hv.Slope(slope=1, y_intercept=0).opts(color='black')
## EAch observation is a different census tract
## Not many points follow the line
## Every model has errors
Step 5c - Explore spatial bias¶
We always need to think about bias, or systematic error, in model results. Every model is going to have some error, but we’d like to see that error evenly distributed. When the error is systematic, it can be an indication that we are missing something important in the model.
In geographic data, it is common for location to be a factor that doesn’t get incorporated into models. After all – we generally expect places that are right next to each other to be more similar than places that are far away (this phenomenon is known as spatial autocorrelation). However, models like this linear regression don’t take location into account at all.
- Compute the model error (predicted - measured) for all census tracts
- Plot the error as a chloropleth map with a diverging color scheme
- Looking at both of your error plots, what do you notice? What are some possible explanations for any bias you see in your model?
### Compute model error for all census tracts
## Use the trained model to predict asthma prevalence for each census tract
model_df['pred_asthma'] = np.exp(reg.predict(X))
## Calculate model error
model_df['error_asthma'] = model_df['pred_asthma'] - model_df['asthma']
### Plot error geographically as a chloropleth
(
plot_chloropleth(model_df, color = 'error_asthma', cmap = 'RdBu')
## Set frame width and aspect ratio
.opts(frame_width = 600, aspect = 'equal')
)
## Add titles and legends etc.
##
What do you notice about your model from looking at the error plots? What additional data, transformations, or model type might help improve your results?
Overall it seems like this model does a pretty good job at revealing an association of asthma and vegetation levels in Denver. I think controlling for data such as industrial pollution rates, age, income, or other demographics could improve our model immensely. Using a multiple regression model, where multiple independent variables could have an effect on asthma rates would probably provide more accurate results.